Part ONE - Project Based

CONTEXT: The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.

PROJECT OBJECTIVE: Goal is to cluster the data and treat them as individual datasets to train Regression models to predict mpg.

Importing the libraries

Import and warehouse data

Observations:

The datasets are combined properly as the final dataset has 398 rows and 9 columns.

We have exported combined dataframe to .csv, .xlsx, .json onto the local machine and we can choose any file extension to import data for future use and analysis.

Data Cleansing

Observations:

Data Analysis and Visualisation

Observations:

From describe() and the visual analysis we could observe:

Observations:

Machine Learning

K-Means

Observations:

K = 4

K = 6

Hierarchical Clustering

Actually in hierarchical clustering we do not decide the number of clusters in advance, but build all possible clusters with various linkage, and then decide based on cophenetic index the cluster that suits the situation and also based on the dendrogram can decide the number of clusters to keep.

Observations:

Hierarchical Clustering K-Means Clustering
connectivity-based clustering centroid-based clustering
computation time is more less computation time
we can select any number of clusters number of clusters are decided at model building time
uses dendrogram to decide on number of clusters uses elbow method to decide on number of clusters
increases quadratic increases linearly
uses agglomerative or divisive algorithms used Llyod's algorithm
can use any distance and linkage measures uses Eucledian distance measure

Observations:

In my opinion 4 seems to be optimal number of clusters because:

Linear Regression on data set without clusters

Linear Regression on data set with K-Means clusters

Obseravtions:

Improvisations:

---------------------------------------------------------------------------------------------------------------

Part Two- Project Based

CONTEXT: Company X curates and packages wine across various vineyards spread throughout the country.

PROJECT OBJECTIVE: Goal is to build a synthetic data generation model using the existing data provided by the company.

Importing the necessary libraries

Importing the data

Observations:

Analysis and Visualisation

Observations:

From pairplot and unique() of Quality feature we know that there must be two clusters.

K-Means Clustering

Observations:

---------------------------------------------------------------------------------------------------------------

Part Three - Project Based

CONTEXT: The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

PROJECT OBJECTIVE: Apply dimensionality reduction technique – PCA and train a model using principal components instead of training the model using just the raw data

Importing the libraries

Import data

Data Analysis and Visualisation

Observations:

Observations:

From describe() and the visual analysis we could observe:

Observations:

PCA and Machine Learning Models

SVM with all attributes

PCA

Observations:

Dimensionality Reduction

SVM with only transformed attributes

Observations:

---------------------------------------------------------------------------------------------------------------

Part Four - Project Based

CONTEXT: Company X is a sports management company for international cricket.

PROJECT OBJECTIVE: Goal is to build a data driven batsman ranking model for the sports management company to make business decisions.

Importing the libraries

Data Import

Data Analysis and Visualisation

Observations:

Observations:

Observations:

Machine Learning model

PCA

Observations:

Dimensionality Reduction

Observations:

K-Means Clustering with transformed variables

Observations:

K=4

Conclusions:

---------------------------------------------------------------------------------------------------------------

Part Five - Question Based

Dimensionality Reduction Techniques in Python

Feature Selection:

Component/Factor Based:

Projection Based:

Using Dimensionality Reduction in Multi-Media(images,videos)

Importing the libraries

Observations:

GuassianNB with original data

Dimensionality reduction using Projection Based methods

GuassianNB model with reduced features

Observations: